The given dataset is a set of approved loans made through the lending club platform.
The dataset contains 10,000 loans that were issued during the months of January, February and March of 2018.
It contains a wide variety of attributes pertaining to the loan applicant's financial status and credit history as well as details of the particular loan that was approved.
The dataset is a snapshot at a particular point of time which is unknown.
The dataset can be broadly divided into 6 different categories of variables:
Demographic Data - Contains data pertaining to the nature of employment of the borrower, the state of the borrower and the type of home ownership (rented/owned/mortaged).
Financial Wellbeing - Contains data related to the income of the borrower and the debt to income ratio, which are indicators of the borrower's financial ability to repay a loan, if taken.
Credit History (Delinquency) - Data points pertaining to the borrower's repayment history, specifically signs of failure to repay. Covers the number of delinquencies, severity of the delinquencies and the time since last delinquency.
Credit History (Tradelines) - Broadly covers number, nature and credit amount of tradelines taken by the borrower and inquiries (indicating credit hungriness of the borrower).
Loan Specifications - Details on the particular loan that was approved (Loan Amount, Tenure, Interest Rate etc.)
Loan Status - Pertains to the particular snapshot of data. At the given point of time, what is the outstanding balance and amounts paid by the borrower till then.
Here's a list of all variables classified under each category:

The dataset contains data points pertaining to a very small period (Jan'18 - Mar'18). The consequence of this is an inability to perform Out Of Time validations on any predictions that are made. Since the data present covers a very small window, it might be possible that any model trained on this data might not generalize well to other time periods as the nature of the financial ecosystem might vary with time.
The snapshot time of the dataset is unclear. There are variables like balance and paid total, which seem to be populated at an unknown point of time after the loan is approved. There are also variables like accounts_opened_24m or months_since_last_credit_inquiry which are more likely captured at the point of loan approval, but could also be data pertaining to the point of time of the snapshot. (For all future analysis we will assume that this kind of data was captured right before the application was approved).
The dataset contains a mix of numeric and categorical data. Categorical variables need to be encoded in order for most predictive models to be able to process them.
The numeric data is not standardized. For many machine learning algorithms that use vector distances (Eg. Gradient Descent, KNN) and some EDA algorithms (Eg. PCA) data needs to be standardized for optimal functioning.
Missing Values - While the dataset is fairly dense for the most part, missing values in variables like emp_length, num_accounts_120d_past_due need to be treated before they are used.
There is a lack of clarity on how certain variables are calculated. For instance, the debt_to_income ratio seems alarmingly high for most of the customers!
#Import Plotly for visualizations
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as pyo
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly
plotly.offline.init_notebook_mode(connected=True)
import pandas as pd
import numpy as np
#Read dataset
loans_df = pd.read_csv('loans_full_schema.csv')
To better understand the quality of the loan portfolio, we can look at the number of loans across each grade and the risk associated with each loan. To evaluate the risk, we define a metric ever_delq_2y as ever delinquent in the last 2 years i.e. if a customer has been delinquent at least once in the last 2 years.
loans_df = loans_df.assign(ever_delq_2y = np.where(loans_df['delinq_2y'] > 0, 1, 0))
loans_df['ever_delq_2y'] = loans_df['ever_delq_2y']*100
grade_risk = loans_df.groupby(['grade']).agg({'ever_delq_2y':'mean'}).reset_index()
grade_count = loans_df.groupby(['grade']).agg({'ever_delq_2y':'count'}).reset_index()
fig = make_subplots(specs=[[{"secondary_y": True}]],x_title='Grade')
fig.add_trace(go.Scatter(x=grade_risk.grade,y=grade_risk.ever_delq_2y,name="Risk % (R)"),secondary_y=True)
fig.add_trace(go.Bar(x=grade_count.grade,y=grade_count.ever_delq_2y,name="Number of Loans (L)"),secondary_y=False)
fig.update_layout(height=600, width=1000, title_text="Number of Loans and Risk by Grade")
It is hypothesized that customers who prefer digital channels for transactions are less risky than customers who opt for cash.
Here we compare the risk of borrowers who go for cash disbursement vs DirectPay.
loans_df = loans_df.assign(ever_delq_2y = np.where(loans_df['delinq_2y'] > 0, 1, 0))
loans_df['ever_delq_2y'] = loans_df['ever_delq_2y']*100
disb_risk = loans_df.groupby(['disbursement_method']).agg({'ever_delq_2y':'mean'}).reset_index()
disb_count = loans_df.groupby(['disbursement_method']).agg({'ever_delq_2y':'count'}).reset_index()
fig = make_subplots(specs=[[{"secondary_y": True}]],x_title='Disbursement Method')
fig.add_trace(go.Scatter(x=disb_risk.disbursement_method,y=disb_risk.ever_delq_2y,name="Risk % (R)"),secondary_y=True)
fig.add_trace(go.Bar(x=disb_count.disbursement_method,y=disb_count.ever_delq_2y,name="Number of Loans (L)"),secondary_y=False)
fig.update_layout(height=600, width=1000, title_text="Number of Loans and Risk by Disbursement Method")
fig.update_yaxes(range=[0, 15],secondary_y=True)
fig.update_yaxes(range=[0, 10000],secondary_y=False)
The income of a borrower is one of the primary factors in gauging the ability of a borrower to repay their loan. Targeting states with a higher median income would be a preferable sales strategy for lenders to procure credit-worthy customers.
While one could use census data for the same information, Lending Club data would filter out population with a higher credit score, and the trends in income might be different than a more generalized estimate.
Below, we look at a visualization of state-wise median income.
geo_income = loans_df.groupby('state').agg({'annual_income': 'median'}).reset_index().round(-2)
px.choropleth(geo_income,
locations = 'state',
color="annual_income",
color_continuous_scale="Aggrnyl",
locationmode='USA-states',
scope="usa",
range_color=(50000, 75000),
title='Median Annual Income by State (Hover for data points)',
height=600
)
Box plots are a quick way of representing the distributions of variables and comparing distributions across different categories.
Here we can quickly analyse how the disbursed loan amounts and interest rates vary for different purposes of loans.
fig = px.box(loans_df, x="loan_purpose", y="loan_amount")
fig.show()
fig2 = px.box(loans_df, x="loan_purpose", y="interest_rate")
fig2.show()
Here we look at the distribution of number of months since the last credit inquiry.
fig = ff.create_distplot([loans_df[loans_df.months_since_last_credit_inquiry.notna()].months_since_last_credit_inquiry],group_labels=['months_since_last_credit_inquiry'], bin_size=2)
fig.show()
loans_df = pd.read_csv('loans_full_schema.csv')
loans_df.count()
emp_title 9167 emp_length 9183 state 10000 homeownership 10000 annual_income 10000 verified_income 10000 debt_to_income 9976 annual_income_joint 1495 verification_income_joint 1455 debt_to_income_joint 1495 delinq_2y 10000 months_since_last_delinq 4342 earliest_credit_line 10000 inquiries_last_12m 10000 total_credit_lines 10000 open_credit_lines 10000 total_credit_limit 10000 total_credit_utilized 10000 num_collections_last_12m 10000 num_historical_failed_to_pay 10000 months_since_90d_late 2285 current_accounts_delinq 10000 total_collection_amount_ever 10000 current_installment_accounts 10000 accounts_opened_24m 10000 months_since_last_credit_inquiry 8729 num_satisfactory_accounts 10000 num_accounts_120d_past_due 9682 num_accounts_30d_past_due 10000 num_active_debit_accounts 10000 total_debit_limit 10000 num_total_cc_accounts 10000 num_open_cc_accounts 10000 num_cc_carrying_balance 10000 num_mort_accounts 10000 account_never_delinq_percent 10000 tax_liens 10000 public_record_bankrupt 10000 loan_purpose 10000 application_type 10000 loan_amount 10000 term 10000 interest_rate 10000 installment 10000 grade 10000 sub_grade 10000 issue_month 10000 loan_status 10000 initial_listing_status 10000 disbursement_method 10000 balance 10000 paid_total 10000 paid_principal 10000 paid_interest 10000 paid_late_fees 10000 dtype: int64
loans_df['emp_title'] = np.where(loans_df['emp_title'].isnull(),'other',loans_df['emp_title'])
loans_df['emp_length'] = np.where(loans_df['emp_length'].isnull(),0,loans_df['emp_length'])
loans_df[loans_df['debt_to_income'].isnull()][['annual_income','debt_to_income','annual_income_joint','debt_to_income_joint']]
| annual_income | debt_to_income | annual_income_joint | debt_to_income_joint | |
|---|---|---|---|---|
| 154 | 0.0 | NaN | 167250.0 | 14.92 |
| 927 | 0.0 | NaN | 28000.0 | 5.61 |
| 1255 | 0.0 | NaN | 150000.0 | 13.81 |
| 1330 | 0.0 | NaN | 84000.0 | 17.43 |
| 1653 | 0.0 | NaN | 74000.0 | 30.63 |
| 1718 | 0.0 | NaN | 128000.0 | 25.95 |
| 1845 | 0.0 | NaN | 134000.0 | 20.52 |
| 2106 | 1.0 | NaN | 74001.0 | 20.33 |
| 2546 | 0.0 | NaN | 88265.0 | 11.91 |
| 2556 | 0.0 | NaN | 77000.0 | 13.73 |
| 3429 | 0.0 | NaN | 36000.0 | 24.10 |
| 4123 | 0.0 | NaN | 127000.0 | 15.97 |
| 5032 | 0.0 | NaN | 95000.0 | 18.61 |
| 5300 | 0.0 | NaN | 90000.0 | 8.92 |
| 5358 | 0.0 | NaN | 75004.0 | 12.88 |
| 5388 | 0.0 | NaN | 79000.0 | 27.42 |
| 5432 | 0.0 | NaN | 80000.0 | 26.93 |
| 6059 | 0.0 | NaN | 84500.0 | 30.32 |
| 7010 | 0.0 | NaN | 450000.0 | 1.00 |
| 7149 | 0.0 | NaN | 151000.0 | 6.07 |
| 7155 | 0.0 | NaN | 107000.0 | 27.72 |
| 8678 | 0.0 | NaN | 120000.0 | 26.27 |
| 9011 | 0.0 | NaN | 150000.0 | 13.68 |
| 9273 | 0.0 | NaN | 170000.0 | 7.69 |
loans_df['debt_to_income'] = np.where(loans_df['debt_to_income'].isnull(),loans_df['debt_to_income_joint'],loans_df['debt_to_income'])
loans_df['months_since_last_delinq'] = np.where(loans_df['months_since_last_delinq'].isnull(),999,loans_df['months_since_last_delinq'])
loans_df['months_since_90d_late'] = np.where(loans_df['months_since_90d_late'].isnull(),999,loans_df['months_since_90d_late'])
loans_df[loans_df['num_accounts_120d_past_due'].isnull()][['num_accounts_120d_past_due','num_accounts_30d_past_due']].groupby('num_accounts_30d_past_due').size()
num_accounts_30d_past_due 0 317 1 1 dtype: int64
loans_df['num_accounts_120d_past_due'] = np.where(loans_df['num_accounts_120d_past_due'].isnull(),0,loans_df['num_accounts_120d_past_due'])
loans_df['months_since_last_credit_inquiry'] = np.where(loans_df['months_since_last_credit_inquiry'].isnull(),999,loans_df['months_since_last_credit_inquiry'])
loans_df = loans_df.drop(['verified_income',
'verification_income_joint',
'term',
'installment',
'grade',
'sub_grade',
'issue_month',
'loan_status',
'initial_listing_status',
'disbursement_method',
'balance',
'paid_total',
'paid_principal',
'paid_interest',
'paid_late_fees'],axis=1)
loans_df['annual_income_max'] = loans_df[['annual_income', 'annual_income_joint']].max(axis=1)
loans_df['debt_to_income_min'] = loans_df[['debt_to_income', 'debt_to_income_joint']].min(axis=1)
loans_df = loans_df.drop(['annual_income_joint','debt_to_income_joint'],axis=1)
loans_df['utilization_ratio'] = np.where(loans_df['total_credit_limit']==0,0,loans_df['total_credit_utilized']/loans_df['total_credit_limit'])
loans_df = loans_df.drop(['total_credit_utilized'],axis=1)
loans_df['vintage_earliest_credit_line'] = 2018-loans_df['earliest_credit_line']
loans_df = loans_df.drop(['earliest_credit_line'],axis=1)
import sklearn
from sklearn.model_selection import train_test_split
train, test = train_test_split(loans_df, test_size=0.25)
def calc_smooth_mean(df, var, target, m):
#Compute the global mean
mean = df[target].mean()
#Compute the number of values and the mean of each group
agg = df.groupby(var)[target].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
#Compute the "smoothed" means
smooth = (counts * means + m * mean) / (counts + m)
#Replace each value by the according smoothed mean
return df[var].map(smooth)
train['emp_title'] = calc_smooth_mean(train, var='emp_title', target='interest_rate', m=300)
train['state'] = calc_smooth_mean(train, var='state', target='interest_rate', m=300)
train['homeownership'] = calc_smooth_mean(train, var='homeownership', target='interest_rate', m=300)
train['loan_purpose'] = calc_smooth_mean(train, var='loan_purpose', target='interest_rate', m=300)
train['application_type'] = calc_smooth_mean(train, var='application_type', target='interest_rate', m=300)
test['emp_title'] = calc_smooth_mean(test, var='emp_title', target='interest_rate', m=300)
test['state'] = calc_smooth_mean(test, var='state', target='interest_rate', m=300)
test['homeownership'] = calc_smooth_mean(test, var='homeownership', target='interest_rate', m=300)
test['loan_purpose'] = calc_smooth_mean(test, var='loan_purpose', target='interest_rate', m=300)
test['application_type'] = calc_smooth_mean(test, var='application_type', target='interest_rate', m=300)
len(train.columns)
40
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=1, max_depth=10)
train1 = train.drop(['interest_rate'],axis=1)
model.fit(train1, train['interest_rate'])
importance = model.feature_importances_
indices = np.argsort(importance)[-20:]
features = train1.columns
train[[features[i] for i in indices]]
| num_active_debit_accounts | loan_purpose | num_total_cc_accounts | num_cc_carrying_balance | annual_income_max | total_credit_lines | state | annual_income | account_never_delinq_percent | months_since_last_delinq | accounts_opened_24m | total_credit_limit | vintage_earliest_credit_line | months_since_last_credit_inquiry | debt_to_income | debt_to_income_min | loan_amount | utilization_ratio | total_debit_limit | emp_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3700 | 4 | 12.972777 | 15 | 8 | 155000.0 | 22 | 12.299349 | 70000.0 | 95.2 | 62.0 | 13 | 94282 | 20 | 6.0 | 24.64 | 23.06 | 7200 | 0.489139 | 33700 | 12.428136 |
| 5646 | 2 | 11.606437 | 6 | 2 | 42000.0 | 10 | 12.310444 | 42000.0 | 100.0 | 999.0 | 1 | 82774 | 18 | 999.0 | 35.20 | 35.20 | 8000 | 0.388600 | 15500 | 12.405345 |
| 588 | 8 | 12.397012 | 14 | 12 | 100000.0 | 29 | 12.235290 | 100000.0 | 89.3 | 26.0 | 10 | 775274 | 14 | 6.0 | 18.18 | 18.18 | 15000 | 0.070725 | 41600 | 12.416009 |
| 1725 | 3 | 12.972777 | 10 | 4 | 80000.0 | 18 | 12.401483 | 45000.0 | 94.4 | 45.0 | 4 | 44400 | 13 | 18.0 | 11.97 | 11.97 | 15000 | 0.394707 | 37300 | 12.472108 |
| 6053 | 1 | 12.972777 | 5 | 4 | 99000.0 | 21 | 12.300419 | 99000.0 | 100.0 | 999.0 | 5 | 122869 | 11 | 3.0 | 9.84 | 9.84 | 6000 | 0.563177 | 2500 | 12.419265 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7266 | 1 | 12.972777 | 2 | 2 | 60000.0 | 5 | 12.522707 | 60000.0 | 100.0 | 999.0 | 1 | 55557 | 10 | 1.0 | 19.31 | 19.31 | 32000 | 0.408121 | 31000 | 12.414381 |
| 1222 | 6 | 12.972777 | 19 | 10 | 209000.0 | 39 | 12.299349 | 209000.0 | 94.9 | 52.0 | 7 | 803161 | 19 | 8.0 | 19.76 | 19.76 | 24000 | 0.171052 | 45200 | 12.410534 |
| 4688 | 4 | 11.606437 | 15 | 7 | 91000.0 | 24 | 12.299349 | 91000.0 | 100.0 | 999.0 | 5 | 53640 | 15 | 0.0 | 12.67 | 12.67 | 12000 | 0.566779 | 17700 | 12.419232 |
| 4335 | 2 | 12.972777 | 9 | 2 | 65000.0 | 17 | 12.522707 | 65000.0 | 100.0 | 999.0 | 2 | 263690 | 11 | 13.0 | 12.92 | 12.92 | 20000 | 0.058971 | 20000 | 12.415976 |
| 1609 | 1 | 12.972777 | 5 | 3 | 60000.0 | 12 | 12.407420 | 40000.0 | 100.0 | 999.0 | 2 | 176680 | 11 | 0.0 | 31.35 | 21.40 | 14400 | 0.204737 | 800 | 12.378779 |
7500 rows × 20 columns
train[[features[i] for i in indices]].corr()
| num_active_debit_accounts | loan_purpose | num_total_cc_accounts | num_cc_carrying_balance | annual_income_max | total_credit_lines | state | annual_income | account_never_delinq_percent | months_since_last_delinq | accounts_opened_24m | total_credit_limit | vintage_earliest_credit_line | months_since_last_credit_inquiry | debt_to_income | debt_to_income_min | loan_amount | utilization_ratio | total_debit_limit | emp_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| num_active_debit_accounts | 1.000000 | -0.019688 | 0.503372 | 0.846123 | 0.127477 | 0.364356 | -0.006579 | 0.151653 | 0.113909 | 0.032277 | 0.270173 | 0.175165 | 0.128693 | -0.063769 | 0.142747 | 0.224159 | 0.161300 | -0.025012 | 0.521882 | -0.027011 |
| loan_purpose | -0.019688 | 1.000000 | 0.013988 | 0.004128 | 0.010461 | 0.019603 | 0.021522 | -0.005635 | -0.012790 | -0.022492 | 0.032983 | 0.010004 | 0.030755 | -0.017703 | 0.050849 | 0.042319 | 0.084673 | -0.013562 | -0.048456 | 0.044581 |
| num_total_cc_accounts | 0.503372 | 0.013988 | 1.000000 | 0.616515 | 0.112257 | 0.773013 | 0.003598 | 0.117078 | 0.050700 | -0.112417 | 0.433478 | 0.226036 | 0.311603 | -0.131354 | 0.131532 | 0.189281 | 0.146379 | -0.171946 | 0.440860 | 0.002168 |
| num_cc_carrying_balance | 0.846123 | 0.004128 | 0.616515 | 1.000000 | 0.105447 | 0.456368 | 0.016815 | 0.119490 | 0.090927 | -0.024260 | 0.351856 | 0.166626 | 0.170033 | -0.087971 | 0.189834 | 0.285010 | 0.128768 | -0.018738 | 0.362143 | 0.015063 |
| annual_income_max | 0.127477 | 0.010461 | 0.112257 | 0.105447 | 1.000000 | 0.200961 | -0.034401 | 0.911938 | -0.008154 | -0.062871 | 0.073402 | 0.541086 | 0.163855 | -0.060128 | -0.023579 | -0.153979 | 0.374943 | -0.118882 | 0.312728 | -0.125367 |
| total_credit_lines | 0.364356 | 0.019603 | 0.773013 | 0.456368 | 0.200961 | 1.000000 | 0.003815 | 0.201582 | 0.035836 | -0.155072 | 0.453808 | 0.383270 | 0.290106 | -0.167891 | 0.197264 | 0.295645 | 0.179246 | -0.060062 | 0.362740 | -0.057359 |
| state | -0.006579 | 0.021522 | 0.003598 | 0.016815 | -0.034401 | 0.003815 | 1.000000 | -0.044787 | -0.019213 | -0.014255 | 0.011787 | -0.039673 | -0.016179 | -0.015774 | 0.056561 | 0.049942 | -0.016668 | 0.024532 | -0.060530 | 0.029992 |
| annual_income | 0.151653 | -0.005635 | 0.117078 | 0.119490 | 0.911938 | 0.201582 | -0.044787 | 1.000000 | -0.016519 | -0.070764 | 0.088930 | 0.516444 | 0.151707 | -0.067587 | -0.173968 | -0.165277 | 0.321557 | -0.097158 | 0.319834 | -0.141721 |
| account_never_delinq_percent | 0.113909 | -0.012790 | 0.050700 | 0.090927 | -0.008154 | 0.035836 | -0.019213 | -0.016519 | 1.000000 | 0.642141 | 0.066066 | 0.025702 | -0.099450 | 0.016169 | 0.073148 | 0.102830 | 0.073430 | -0.003650 | 0.212903 | -0.015848 |
| months_since_last_delinq | 0.032277 | -0.022492 | -0.112417 | -0.024260 | -0.062871 | -0.155072 | -0.014255 | -0.070764 | 0.642141 | 1.000000 | -0.062352 | -0.067870 | -0.139494 | 0.068196 | 0.031431 | 0.027803 | 0.029410 | 0.033790 | 0.138354 | -0.011943 |
| accounts_opened_24m | 0.270173 | 0.032983 | 0.433478 | 0.351856 | 0.073402 | 0.453808 | 0.011787 | 0.088930 | 0.066066 | -0.062352 | 1.000000 | 0.179257 | -0.010408 | -0.314254 | 0.075838 | 0.128460 | 0.020980 | -0.067215 | 0.126735 | 0.027073 |
| total_credit_limit | 0.175165 | 0.010004 | 0.226036 | 0.166626 | 0.541086 | 0.383270 | -0.039673 | 0.516444 | 0.025702 | -0.067870 | 0.179257 | 1.000000 | 0.225628 | -0.125741 | 0.072539 | 0.071659 | 0.315199 | -0.456295 | 0.417189 | -0.107091 |
| vintage_earliest_credit_line | 0.128693 | 0.030755 | 0.311603 | 0.170033 | 0.163855 | 0.290106 | -0.016179 | 0.151707 | -0.099450 | -0.139494 | -0.010408 | 0.225628 | 1.000000 | 0.007849 | 0.092769 | 0.115823 | 0.114667 | -0.134019 | 0.234138 | 0.027901 |
| months_since_last_credit_inquiry | -0.063769 | -0.017703 | -0.131354 | -0.087971 | -0.060128 | -0.167891 | -0.015774 | -0.067587 | 0.016169 | 0.068196 | -0.314254 | -0.125741 | 0.007849 | 1.000000 | -0.047192 | -0.060301 | -0.011820 | 0.062238 | -0.022721 | -0.012564 |
| debt_to_income | 0.142747 | 0.050849 | 0.131532 | 0.189834 | -0.023579 | 0.197264 | 0.056561 | -0.173968 | 0.073148 | 0.031431 | 0.075838 | 0.072539 | 0.092769 | -0.047192 | 1.000000 | 0.631547 | 0.057199 | 0.124996 | 0.060087 | 0.059370 |
| debt_to_income_min | 0.224159 | 0.042319 | 0.189281 | 0.285010 | -0.153979 | 0.295645 | 0.049942 | -0.165277 | 0.102830 | 0.027803 | 0.128460 | 0.071659 | 0.115823 | -0.060301 | 0.631547 | 1.000000 | 0.002186 | 0.220427 | 0.103016 | 0.068597 |
| loan_amount | 0.161300 | 0.084673 | 0.146379 | 0.128768 | 0.374943 | 0.179246 | -0.016668 | 0.321557 | 0.073430 | 0.029410 | 0.020980 | 0.315199 | 0.114667 | -0.011820 | 0.057199 | 0.002186 | 1.000000 | -0.097273 | 0.322096 | -0.073212 |
| utilization_ratio | -0.025012 | -0.013562 | -0.171946 | -0.018738 | -0.118882 | -0.060062 | 0.024532 | -0.097158 | -0.003650 | 0.033790 | -0.067215 | -0.456295 | -0.134019 | 0.062238 | 0.124996 | 0.220427 | -0.097273 | 1.000000 | -0.225936 | 0.011521 |
| total_debit_limit | 0.521882 | -0.048456 | 0.440860 | 0.362143 | 0.312728 | 0.362740 | -0.060530 | 0.319834 | 0.212903 | 0.138354 | 0.126735 | 0.417189 | 0.234138 | -0.022721 | 0.060087 | 0.103016 | 0.322096 | -0.225936 | 1.000000 | -0.118519 |
| emp_title | -0.027011 | 0.044581 | 0.002168 | 0.015063 | -0.125367 | -0.057359 | 0.029992 | -0.141721 | -0.015848 | -0.011943 | 0.027073 | -0.107091 | 0.027901 | -0.012564 | 0.059370 | 0.068597 | -0.073212 | 0.011521 | -0.118519 | 1.000000 |
features_new = ['num_active_debit_accounts',
'account_never_delinq_percent',
'loan_purpose',
'annual_income',
'accounts_opened_24m',
'months_since_last_credit_inquiry',
'state',
'vintage_earliest_credit_line',
'debt_to_income',
'loan_amount',
'utilization_ratio',
'emp_title'
]
train[features_new].corr()
| num_active_debit_accounts | account_never_delinq_percent | loan_purpose | annual_income | accounts_opened_24m | months_since_last_credit_inquiry | state | vintage_earliest_credit_line | debt_to_income | loan_amount | utilization_ratio | emp_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| num_active_debit_accounts | 1.000000 | 0.113909 | -0.019688 | 0.151653 | 0.270173 | -0.063769 | -0.006579 | 0.128693 | 0.142747 | 0.161300 | -0.025012 | -0.027011 |
| account_never_delinq_percent | 0.113909 | 1.000000 | -0.012790 | -0.016519 | 0.066066 | 0.016169 | -0.019213 | -0.099450 | 0.073148 | 0.073430 | -0.003650 | -0.015848 |
| loan_purpose | -0.019688 | -0.012790 | 1.000000 | -0.005635 | 0.032983 | -0.017703 | 0.021522 | 0.030755 | 0.050849 | 0.084673 | -0.013562 | 0.044581 |
| annual_income | 0.151653 | -0.016519 | -0.005635 | 1.000000 | 0.088930 | -0.067587 | -0.044787 | 0.151707 | -0.173968 | 0.321557 | -0.097158 | -0.141721 |
| accounts_opened_24m | 0.270173 | 0.066066 | 0.032983 | 0.088930 | 1.000000 | -0.314254 | 0.011787 | -0.010408 | 0.075838 | 0.020980 | -0.067215 | 0.027073 |
| months_since_last_credit_inquiry | -0.063769 | 0.016169 | -0.017703 | -0.067587 | -0.314254 | 1.000000 | -0.015774 | 0.007849 | -0.047192 | -0.011820 | 0.062238 | -0.012564 |
| state | -0.006579 | -0.019213 | 0.021522 | -0.044787 | 0.011787 | -0.015774 | 1.000000 | -0.016179 | 0.056561 | -0.016668 | 0.024532 | 0.029992 |
| vintage_earliest_credit_line | 0.128693 | -0.099450 | 0.030755 | 0.151707 | -0.010408 | 0.007849 | -0.016179 | 1.000000 | 0.092769 | 0.114667 | -0.134019 | 0.027901 |
| debt_to_income | 0.142747 | 0.073148 | 0.050849 | -0.173968 | 0.075838 | -0.047192 | 0.056561 | 0.092769 | 1.000000 | 0.057199 | 0.124996 | 0.059370 |
| loan_amount | 0.161300 | 0.073430 | 0.084673 | 0.321557 | 0.020980 | -0.011820 | -0.016668 | 0.114667 | 0.057199 | 1.000000 | -0.097273 | -0.073212 |
| utilization_ratio | -0.025012 | -0.003650 | -0.013562 | -0.097158 | -0.067215 | 0.062238 | 0.024532 | -0.134019 | 0.124996 | -0.097273 | 1.000000 | 0.011521 |
| emp_title | -0.027011 | -0.015848 | 0.044581 | -0.141721 | 0.027073 | -0.012564 | 0.029992 | 0.027901 | 0.059370 | -0.073212 | 0.011521 | 1.000000 |
fig = px.box(train, height=500, y=['num_active_debit_accounts', 'accounts_opened_24m', 'vintage_earliest_credit_line', 'months_since_last_credit_inquiry'])
fig2 = px.box(train, height=500, width=1000, y=['debt_to_income', 'account_never_delinq_percent'])
fig3 = px.box(train, height=500, width=1000, y=['annual_income', 'loan_amount'])
fig4 = px.box(train, height=500, width=1000, y=['loan_purpose','state', 'emp_title' ])
fig5 = px.box(train, height=500, width=600, y=['utilization_ratio'])
fig.update_yaxes(range=[0, 40])
fig.show()
fig2.update_yaxes(range=[0, 200])
fig2.show()
fig3.update_yaxes(range=[0, 200000])
fig3.show()
fig4.show()
fig5.show()
features_new = ['num_active_debit_accounts',
'account_never_delinq_percent',
'loan_purpose',
'annual_income',
'accounts_opened_24m',
'months_since_last_credit_inquiry',
'state',
'vintage_earliest_credit_line',
'debt_to_income',
'loan_amount',
'utilization_ratio',
'emp_title'
]
from sklearn.preprocessing import MinMaxScaler
train1 = MinMaxScaler().fit_transform(train[features_new+['interest_rate']])
def print_vif(train1,cols):
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_train1 = pd.DataFrame(train1,columns=cols)
vif = [variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])]
vif_s =pd.Series(vif, index =X_train1.columns)
print(vif_s)
print_vif(train1[:,:12],features_new)
num_active_debit_accounts 3.723367 account_never_delinq_percent 22.613346 loan_purpose 3.332340 annual_income 2.914600 accounts_opened_24m 3.551612 months_since_last_credit_inquiry 1.296747 state 3.372590 vintage_earliest_credit_line 4.188374 debt_to_income 2.929042 loan_amount 3.770346 utilization_ratio 3.327675 emp_title 12.425064 dtype: float64
features_new.remove('account_never_delinq_percent')
train1 = np.delete(train1,1,axis=1)
print_vif(train1[:,:11],features_new)
num_active_debit_accounts 3.633913 loan_purpose 3.231298 annual_income 2.832511 accounts_opened_24m 3.402479 months_since_last_credit_inquiry 1.263982 state 3.218553 vintage_earliest_credit_line 4.095795 debt_to_income 2.891732 loan_amount 3.626323 utilization_ratio 3.083839 emp_title 8.178301 dtype: float64
features_new.remove('emp_title')
train1 = np.delete(train1,10,axis=1)
print_vif(train1[:,:10],features_new)
num_active_debit_accounts 3.611555 loan_purpose 3.039953 annual_income 2.830628 accounts_opened_24m 3.212519 months_since_last_credit_inquiry 1.242758 state 2.998644 vintage_earliest_credit_line 3.759728 debt_to_income 2.850160 loan_amount 3.580365 utilization_ratio 2.815627 dtype: float64
def run_lin_regression(y,x):
import statsmodels.api as sm
model = sm.OLS(y, x)
#Statsmodels has a fit_regularized function which uses regularization,
#but the summary function has not been implemented yet.
#Using the normal fit function here for the sake of explainability.
fit_model = model.fit()
summary = fit_model.summary()
print(summary)
return fit_model
model = run_lin_regression(train1[:, -1], train1[:,:10])
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.706
Model: OLS Adj. R-squared (uncentered): 0.706
Method: Least Squares F-statistic: 1800.
Date: Tue, 02 Nov 2021 Prob (F-statistic): 0.00
Time: 04:14:01 Log-Likelihood: 2045.8
No. Observations: 7500 AIC: -4072.
Df Residuals: 7490 BIC: -4002.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.0483 0.030 1.621 0.105 -0.010 0.107
x2 0.0784 0.005 16.043 0.000 0.069 0.088
x3 -0.4311 0.079 -5.423 0.000 -0.587 -0.275
x4 0.2602 0.018 14.130 0.000 0.224 0.296
x5 -0.0310 0.007 -4.649 0.000 -0.044 -0.018
x6 0.0897 0.008 11.151 0.000 0.074 0.105
x7 -0.0190 0.014 -1.392 0.164 -0.046 0.008
x8 0.5840 0.068 8.557 0.000 0.450 0.718
x9 0.1146 0.008 13.520 0.000 0.098 0.131
x10 0.3763 0.012 30.784 0.000 0.352 0.400
==============================================================================
Omnibus: 673.269 Durbin-Watson: 2.023
Prob(Omnibus): 0.000 Jarque-Bera (JB): 888.845
Skew: 0.766 Prob(JB): 9.77e-194
Kurtosis: 3.707 Cond. No. 39.5
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
features_new.remove('num_active_debit_accounts')
features_new.remove('vintage_earliest_credit_line')
train1 = np.delete(train1,6,axis=1)
train1 = np.delete(train1,0,axis=1)
model = run_lin_regression(train1[:, -1], train1[:,:8])
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.706
Model: OLS Adj. R-squared (uncentered): 0.706
Method: Least Squares F-statistic: 2248.
Date: Tue, 02 Nov 2021 Prob (F-statistic): 0.00
Time: 04:14:11 Log-Likelihood: 2043.9
No. Observations: 7500 AIC: -4072.
Df Residuals: 7492 BIC: -4016.
Df Model: 8
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.0775 0.005 16.065 0.000 0.068 0.087
x2 -0.4362 0.076 -5.706 0.000 -0.586 -0.286
x3 0.2673 0.017 15.435 0.000 0.233 0.301
x4 -0.0312 0.007 -4.719 0.000 -0.044 -0.018
x5 0.0889 0.008 11.196 0.000 0.073 0.105
x6 0.5842 0.066 8.887 0.000 0.455 0.713
x7 0.1152 0.008 13.880 0.000 0.099 0.131
x8 0.3773 0.012 30.977 0.000 0.353 0.401
==============================================================================
Omnibus: 673.597 Durbin-Watson: 2.023
Prob(Omnibus): 0.000 Jarque-Bera (JB): 889.128
Skew: 0.766 Prob(JB): 8.48e-194
Kurtosis: 3.706 Cond. No. 35.8
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
test1 = MinMaxScaler().fit_transform(test[features_new+['interest_rate']])
train_pred = model.predict(train1[:,:8])
test_pred = model.predict(test1[:,:8])
from statsmodels.tools.eval_measures import mse
print("MSE of training data:")
print(mse(train1[:,-1],train_pred))
print("MSE of test data:")
print(mse(test1[:,-1],test_pred))
MSE of training data: 0.033948702571833675 MSE of test data: 0.03368352792118426
features_nn = ['num_active_debit_accounts',
'loan_purpose',
'annual_income',
'accounts_opened_24m',
'months_since_last_credit_inquiry',
'state',
'vintage_earliest_credit_line',
'debt_to_income',
'loan_amount',
'utilization_ratio'
]
train1 = MinMaxScaler().fit_transform(train[features_nn+['interest_rate']])
test1 = MinMaxScaler().fit_transform(test[features_nn+['interest_rate']])
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import matplotlib.pyplot as plt
def run_ann(x,y):
model = Sequential([
Dense(10, activation=tf.nn.relu, input_shape=(10,)),
Dense(10, activation=tf.nn.relu),
Dense(1)
])
opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mean_squared_error',optimizer=opt, metrics=['mse', 'mae'])
fit_model = model.fit(x,y,epochs=50, validation_split=0.2)
history_df = pd.DataFrame(fit_model.history)
#Plotting the training and validation loss across epochs
plt.plot(history_df['loss'], label='loss')
plt.plot(history_df['val_loss'], label='val_loss')
plt.legend()
plt.show()
return model
model = run_ann(train1[:,:10], train1[:, -1])
Epoch 1/50 188/188 [==============================] - 1s 1ms/step - loss: 0.1447 - mse: 0.1447 - mae: 0.3209 - val_loss: 0.1240 - val_mse: 0.1240 - val_mae: 0.2856 Epoch 2/50 188/188 [==============================] - 0s 825us/step - loss: 0.0904 - mse: 0.0904 - mae: 0.2351 - val_loss: 0.0748 - val_mse: 0.0748 - val_mae: 0.2074 Epoch 3/50 188/188 [==============================] - 0s 1ms/step - loss: 0.0558 - mse: 0.0558 - mae: 0.1797 - val_loss: 0.0536 - val_mse: 0.0536 - val_mae: 0.1778 Epoch 4/50 188/188 [==============================] - 0s 852us/step - loss: 0.0451 - mse: 0.0451 - mae: 0.1651 - val_loss: 0.0477 - val_mse: 0.0477 - val_mae: 0.1705 Epoch 5/50 188/188 [==============================] - 0s 770us/step - loss: 0.0416 - mse: 0.0416 - mae: 0.1602 - val_loss: 0.0451 - val_mse: 0.0451 - val_mae: 0.1668 Epoch 6/50 188/188 [==============================] - 0s 760us/step - loss: 0.0398 - mse: 0.0398 - mae: 0.1569 - val_loss: 0.0435 - val_mse: 0.0435 - val_mae: 0.1641 Epoch 7/50 188/188 [==============================] - 0s 798us/step - loss: 0.0386 - mse: 0.0386 - mae: 0.1545 - val_loss: 0.0423 - val_mse: 0.0423 - val_mae: 0.1618 Epoch 8/50 188/188 [==============================] - 0s 904us/step - loss: 0.0377 - mse: 0.0377 - mae: 0.1526 - val_loss: 0.0413 - val_mse: 0.0413 - val_mae: 0.1599 Epoch 9/50 188/188 [==============================] - 0s 874us/step - loss: 0.0370 - mse: 0.0370 - mae: 0.1510 - val_loss: 0.0405 - val_mse: 0.0405 - val_mae: 0.1583 Epoch 10/50 188/188 [==============================] - 0s 871us/step - loss: 0.0364 - mse: 0.0364 - mae: 0.1499 - val_loss: 0.0399 - val_mse: 0.0399 - val_mae: 0.1570 Epoch 11/50 188/188 [==============================] - 0s 892us/step - loss: 0.0359 - mse: 0.0359 - mae: 0.1488 - val_loss: 0.0394 - val_mse: 0.0394 - val_mae: 0.1562 Epoch 12/50 188/188 [==============================] - 0s 835us/step - loss: 0.0356 - mse: 0.0356 - mae: 0.1481 - val_loss: 0.0390 - val_mse: 0.0390 - val_mae: 0.1557 Epoch 13/50 188/188 [==============================] - 0s 779us/step - loss: 0.0353 - mse: 0.0353 - mae: 0.1476 - val_loss: 0.0387 - val_mse: 0.0387 - val_mae: 0.1551 Epoch 14/50 188/188 [==============================] - 0s 939us/step - loss: 0.0350 - mse: 0.0350 - mae: 0.1469 - val_loss: 0.0384 - val_mse: 0.0384 - val_mae: 0.1549 Epoch 15/50 188/188 [==============================] - 0s 793us/step - loss: 0.0348 - mse: 0.0348 - mae: 0.1468 - val_loss: 0.0383 - val_mse: 0.0383 - val_mae: 0.1539 Epoch 16/50 188/188 [==============================] - 0s 810us/step - loss: 0.0346 - mse: 0.0346 - mae: 0.1460 - val_loss: 0.0381 - val_mse: 0.0381 - val_mae: 0.1540 Epoch 17/50 188/188 [==============================] - 0s 767us/step - loss: 0.0344 - mse: 0.0344 - mae: 0.1458 - val_loss: 0.0379 - val_mse: 0.0379 - val_mae: 0.1537 Epoch 18/50 188/188 [==============================] - 0s 779us/step - loss: 0.0343 - mse: 0.0343 - mae: 0.1455 - val_loss: 0.0378 - val_mse: 0.0378 - val_mae: 0.1532 Epoch 19/50 188/188 [==============================] - 0s 799us/step - loss: 0.0341 - mse: 0.0341 - mae: 0.1450 - val_loss: 0.0376 - val_mse: 0.0376 - val_mae: 0.1534 Epoch 20/50 188/188 [==============================] - 0s 769us/step - loss: 0.0340 - mse: 0.0340 - mae: 0.1450 - val_loss: 0.0376 - val_mse: 0.0376 - val_mae: 0.1526 Epoch 21/50 188/188 [==============================] - 0s 769us/step - loss: 0.0339 - mse: 0.0339 - mae: 0.1446 - val_loss: 0.0374 - val_mse: 0.0374 - val_mae: 0.1523 Epoch 22/50 188/188 [==============================] - 0s 769us/step - loss: 0.0338 - mse: 0.0338 - mae: 0.1440 - val_loss: 0.0373 - val_mse: 0.0373 - val_mae: 0.1528 Epoch 23/50 188/188 [==============================] - 0s 812us/step - loss: 0.0337 - mse: 0.0337 - mae: 0.1443 - val_loss: 0.0373 - val_mse: 0.0373 - val_mae: 0.1518 Epoch 24/50 188/188 [==============================] - 0s 876us/step - loss: 0.0336 - mse: 0.0336 - mae: 0.1438 - val_loss: 0.0372 - val_mse: 0.0372 - val_mae: 0.1520 Epoch 25/50 188/188 [==============================] - 0s 907us/step - loss: 0.0335 - mse: 0.0335 - mae: 0.1436 - val_loss: 0.0371 - val_mse: 0.0371 - val_mae: 0.1517 Epoch 26/50 188/188 [==============================] - 0s 1000us/step - loss: 0.0334 - mse: 0.0334 - mae: 0.1436 - val_loss: 0.0371 - val_mse: 0.0371 - val_mae: 0.1513 Epoch 27/50 188/188 [==============================] - 0s 846us/step - loss: 0.0333 - mse: 0.0333 - mae: 0.1431 - val_loss: 0.0369 - val_mse: 0.0369 - val_mae: 0.1520 Epoch 28/50 188/188 [==============================] - 0s 848us/step - loss: 0.0333 - mse: 0.0333 - mae: 0.1431 - val_loss: 0.0369 - val_mse: 0.0369 - val_mae: 0.1515 Epoch 29/50 188/188 [==============================] - 0s 837us/step - loss: 0.0332 - mse: 0.0332 - mae: 0.1430 - val_loss: 0.0369 - val_mse: 0.0369 - val_mae: 0.1510 Epoch 30/50 188/188 [==============================] - 0s 782us/step - loss: 0.0331 - mse: 0.0331 - mae: 0.1424 - val_loss: 0.0367 - val_mse: 0.0367 - val_mae: 0.1522 Epoch 31/50 188/188 [==============================] - 0s 795us/step - loss: 0.0331 - mse: 0.0331 - mae: 0.1428 - val_loss: 0.0367 - val_mse: 0.0367 - val_mae: 0.1513 Epoch 32/50 188/188 [==============================] - 0s 782us/step - loss: 0.0331 - mse: 0.0331 - mae: 0.1425 - val_loss: 0.0367 - val_mse: 0.0367 - val_mae: 0.1513 Epoch 33/50 188/188 [==============================] - 0s 794us/step - loss: 0.0330 - mse: 0.0330 - mae: 0.1426 - val_loss: 0.0367 - val_mse: 0.0367 - val_mae: 0.1511 Epoch 34/50 188/188 [==============================] - 0s 778us/step - loss: 0.0330 - mse: 0.0330 - mae: 0.1422 - val_loss: 0.0366 - val_mse: 0.0366 - val_mae: 0.1508 Epoch 35/50 188/188 [==============================] - 0s 806us/step - loss: 0.0329 - mse: 0.0329 - mae: 0.1422 - val_loss: 0.0367 - val_mse: 0.0367 - val_mae: 0.1504 Epoch 36/50 188/188 [==============================] - 0s 761us/step - loss: 0.0329 - mse: 0.0329 - mae: 0.1421 - val_loss: 0.0365 - val_mse: 0.0365 - val_mae: 0.1508 Epoch 37/50 188/188 [==============================] - 0s 760us/step - loss: 0.0329 - mse: 0.0329 - mae: 0.1421 - val_loss: 0.0365 - val_mse: 0.0365 - val_mae: 0.1510 Epoch 38/50 188/188 [==============================] - 0s 739us/step - loss: 0.0328 - mse: 0.0328 - mae: 0.1421 - val_loss: 0.0365 - val_mse: 0.0365 - val_mae: 0.1507 Epoch 39/50 188/188 [==============================] - 0s 842us/step - loss: 0.0328 - mse: 0.0328 - mae: 0.1419 - val_loss: 0.0364 - val_mse: 0.0364 - val_mae: 0.1510 Epoch 40/50 188/188 [==============================] - 0s 769us/step - loss: 0.0328 - mse: 0.0328 - mae: 0.1418 - val_loss: 0.0364 - val_mse: 0.0364 - val_mae: 0.1510 Epoch 41/50 188/188 [==============================] - 0s 765us/step - loss: 0.0328 - mse: 0.0328 - mae: 0.1417 - val_loss: 0.0363 - val_mse: 0.0363 - val_mae: 0.1511 Epoch 42/50 188/188 [==============================] - 0s 786us/step - loss: 0.0327 - mse: 0.0327 - mae: 0.1418 - val_loss: 0.0363 - val_mse: 0.0363 - val_mae: 0.1506 Epoch 43/50 188/188 [==============================] - 0s 792us/step - loss: 0.0327 - mse: 0.0327 - mae: 0.1416 - val_loss: 0.0363 - val_mse: 0.0363 - val_mae: 0.1508 Epoch 44/50 188/188 [==============================] - 0s 760us/step - loss: 0.0327 - mse: 0.0327 - mae: 0.1417 - val_loss: 0.0363 - val_mse: 0.0363 - val_mae: 0.1507 Epoch 45/50 188/188 [==============================] - 0s 816us/step - loss: 0.0327 - mse: 0.0327 - mae: 0.1415 - val_loss: 0.0362 - val_mse: 0.0362 - val_mae: 0.1508 Epoch 46/50 188/188 [==============================] - 0s 969us/step - loss: 0.0326 - mse: 0.0326 - mae: 0.1415 - val_loss: 0.0363 - val_mse: 0.0363 - val_mae: 0.1505 Epoch 47/50 188/188 [==============================] - 0s 819us/step - loss: 0.0326 - mse: 0.0326 - mae: 0.1413 - val_loss: 0.0362 - val_mse: 0.0362 - val_mae: 0.1509 Epoch 48/50 188/188 [==============================] - 0s 840us/step - loss: 0.0326 - mse: 0.0326 - mae: 0.1415 - val_loss: 0.0362 - val_mse: 0.0362 - val_mae: 0.1504 Epoch 49/50 188/188 [==============================] - 0s 828us/step - loss: 0.0326 - mse: 0.0326 - mae: 0.1413 - val_loss: 0.0362 - val_mse: 0.0362 - val_mae: 0.1504 Epoch 50/50 188/188 [==============================] - 0s 786us/step - loss: 0.0326 - mse: 0.0326 - mae: 0.1414 - val_loss: 0.0362 - val_mse: 0.0362 - val_mae: 0.1504
model.evaluate(train1[:,:10], train1[:, -1])
model.evaluate(test1[:,:10], test1[:, -1])
235/235 [==============================] - 0s 442us/step - loss: 0.0333 - mse: 0.0333 - mae: 0.1430 79/79 [==============================] - 0s 437us/step - loss: 0.0330 - mse: 0.0330 - mae: 0.1381
[0.033040814101696014, 0.033040814101696014, 0.13808883726596832]